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In generalized linear regression problems with an abundant num- 
ber of features, lasso-type regularization which imposes an I -constraint 
on the regression coefficients has become a widely established tech- 
nique. Deficiencies of the lasso in certain scenarios, notably strongly 
correlated design, were unmasked when Zou and Hastie [J. Roy. 
Statist. Soc. Ser. B 67 (2005) 301-320] introduced the elastic net. 
In this paper we propose to extend the elastic net by admitting gen- 
eral nonnegative quadratic constraints as a second form of regular- 
ization. The generalized ridge-type constraint will typically make use 
of the known association structure of features, for example, by using 
temporal- or spatial closeness. 

We study properties of the resulting "structured elastic net" re- 
gression estimation procedure, including basic asymptotics and the 
issue of model selection consistency. In this vein, we provide an analog 
to the so-called "irrepresentable condition" which holds for the lasso. 
Moreover, we outline algorithmic solutions for the structured elastic 
net within the generalized linear model family. The rationale and the 
performance of our approach is illustrated by means of simulated and 
real world data, with a focus on signal regression. 

1. Introduction. We consider regression problems with a linear 
predictor. Let X = (X\, . . . , X p ) T be a random vector of real- valued fea- 
tures/predictors and let Y be a random response variable taking values in a 
set y. Given a realization x = (x\ x p ) T of X, a prediction y for a specific 
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functional of the distribution of Y\X. = x is obtained via a linear predictor 

/(x; /3 , /3) = fa + x T /3, /3 = (/3i,..., /3 P ) T , 

and a function £:R — >y such that y = £(/(x)). Given an i.i.d. sample 
S 1 = {(xj, i/i)}™ =1 from (R p x y) n , an optimal set of coefficients /3q, /3 = 
. . . ,/3 P ) T can be determined by minimization of a criterion of the form 

n 

(1) (Aj,3) = argminVL(y i ,/(x i ;/3 ,/3)), 

(A>,/3) i=1 

where Ii^xK-^ Rq" is a convex loss function. The loss function is chosen 
according to the specific prediction problem, so that large loss represents 
bad fit to the observed sample S. Approach (1) usually yields poor esti- 
mates /3o,{3 if n is not one order of magnitude larger than p. In particular, 
if p> n, approach (1) is not well-defined in the sense that there exist in- 
finitely many minimizers Po,P- One way to cope with a small n/p ratio is to 
employ a regularizer f2(/3). A traditional approach due to Hoerl and Ken- 
nard (1970) minimizes the loss in equation (1) subject to an £ 2 -constraint 
on (3. In the situation that (3 is supposed to be sparse, Tibshirani (1996) 
proposed, under the acronym "lasso," to work with an I 1 -constraint, that 
is, one maximizes the loss subject to fi(/3) = ||/3||i < s,s > 0. The latter is 
particularly attractive if one is interested in feature selection, since one ob- 
tains estimates /3j,j G {1, . . . ,p}, which equal exactly zero, such that feature 
j does not contribute to prediction, for which we say that feature j is "not 
selected." Continuous shrinkage [Fan and Li (2001)] and the existence of effi- 
cient algorithms [Efron et al. (2004), Genkin, Lewis and Madigan (2007)] for 
determining the coefficients are further virtues of the lasso. Its limitations 
have recently been revealed by several researchers. Zou and Hastie (2005) 
pointed out that the lasso need not be unique in the p> n setting, where 
the lasso is able to select at most n features [Rosset, Zhu and Hastie (2004)]. 
Furthermore, Zou and Hastie stated that the lasso does not distinguish be- 
tween "irrelevant" and "relevant but redundant" features. In particular, if 
there is a group of correlated features, then the lasso tends to select one 
arbitrary member of the group while ignoring the remainder. The combined 
regularizer of the elastic net Q((3) = a>\\(3\\i + (1 — a)||/3|| 2 , a G (0, 1) is shown 
to provide remedy in this regard. 

A second double regularizer — tailored to one-dimensional signal regression — 
is employed by the fused lasso [Tibshirani et al. (2005)], who propagate 
fi(/3) = all/311! + (1 - a)||D/3||i, where 

D: R p ^R p -\ 

(2) 

(Pi,-- -,P P ) T ^ ([& -Pi],...,[/3 P - /3 P _!]) T 
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is the first forward difference operator. The total variation regularizer is 
meaningful whenever there is an order relation, notably a temporal one, 
among the features. The fused lasso has a property which can be beneficial 
for interpretation: it automatically clusters the features, since the sequence 
j3\, . . . , f3 p is blockwise constant. 

In this paper we study a regularizer which is intermediate between the 
elastic net and the fused lasso. Our regularizer combines an i 1 -constraint 
with a quadratic form: 

(3) fi(/3) = aPIK + (l-a)/3 T A/3, 

where A = (ljj')i<j,j'< P is assumed to be symmetric and positive semidefi- 
nite. Setting A = I yields the elastic net. The inclusion of A aims at captur- 
ing the a priori association structure (if available) of the features in more 
generality than the fused lasso. Therefore, expression (3) will be referred to 
as the structured elastic net regularizer. The structured elastic net estimator 
is defined as 

n 

00, 3) =argminy]L(y i ,/(x i ;/3o,/3)) 

(A>,/3) i= i 

(4) 

subject to a||/3||i + (l-a)/3' A/3 < s, a G (0, 1), s > 0, 
which is equivalent to the Lagrangian formulation 

n 

(A), 3) = argmin VLfe, /(x l5 /? ,/3)) + Ai||/3||i + A 2 /3 T A/3, 
i=i 

Ai,A 2 >0. 

The rest of the paper is organized as follows: in Section 2 we discuss the 
choice of the matrix A, followed by an analysis of some important proper- 
ties of our proposal (5) in Section 3. Section 4 is devoted to asymptotics and 
consistency questions, motivating the introduction of the adaptive struc- 
tured elastic net. Section 5 presents an algorithmic solution to compute the 
minimizers (5) in the generalized linear model family. The practical per- 
formance of the structured elastic net is contained in Section 6. Section 
7 concludes with a discussion and an outlook. All proofs can be found in 
the online supplement supporting this article [Slawski, zu Castell and Tutz 
(2010)]. 
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2. Structured features. 

2.1. Motivation. A considerable fraction of contemporary regression prob- 
lems are characterized by a large number of features, which are either of the 
same order of magnitude as the sample size or even several orders larger 
(p^>n). Common instances thereof are feature sets consisting of sampled 
signals, pixels of an image, spatially sampled data, or gene expression inten- 
sities. Beside high dimensionality of the feature space, these examples have 
in common that the feature set can be arranged according to an a priori as- 
sociation structure. If a sampled signal does not vary rapidly, the influence 
of nearby sampling points on the response can be expected to be similar; 
correspondingly, this applies to adjacent pixels of an image, or, more gen- 
erally, to any other form of spatially linked features. In genomics genes can 
be categorized into functional groups, or one has prior knowledge of their 
functions and interactions within biochemical reaction chains, the so-called 
pathways. 

Figures 1 and 2 display two well-known examples, phoneme- and hand- 
written digit classification. These examples are well apt to illustrate the idea 
of the structured elastic net regularizer, since it is sensible to assume that 
the prediction problem is not only characterized by smoothness with respect 
to a given structure, but also by sparsity: in the phoneme classification ex- 
ample, visually only the first hundred frequencies seem to carry information 
relevant to the prediction problem. A similar rationale applies to the second 
example, where the arc of the numeral eight in the lower half of the picture is 
the eminent characteristic that admits a distinction from the numeral nine. 

2.2. Gauss-Markov random fields. Given a large, but structured set of 
features, its structure can be exploited to cope with high dimensionality in 
regression estimation. The estimands {/3j}j =1 form a finite set such that 
their prior dependence structure can conveniently be described by means of 
a graph Q = (V,E), V = {/Si, . . . ,/3 p }, E C V x V. We exclude loops, that 
is, (f3j,(3j) £ E for all j. The edges may additionally be weighted by a func- 
tion w.E—tM., w((/3j, f3j>)) = w((Pf , f3j)) for all edges in E. We will use the 
notation fij ~ (3ji to express that j3j and fij> are connected by an edge in 
Q. The weight function can be extended to a function on V x V by setting 
w((p j ,p j ,))=w((p j ,,p j )) = if (/?,-, /V) i E. 

The graph is interpreted in terms of the Gauss-Markov random fields 
[Besag (1974); Rue and Held (2001)]. In our setup, the pairwise Markov 
property reads 



(6) 
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Fig. 1. Phoneme data [Hastie, Buja and Tibshirani (1995)]. The upper panel shows sev- 
eral thousand log-periodogramms of the speech frames for the phonemes "aa" (as occuring 
in "dark") and "ao" (as occuring in "water"). The classwise means are given by thick lines. 
We use linear logistic regression to predict the phoneme given a log-periodogramm. The 
lower panel depicts the resulting coefficients when using the lasso (left panel), a first- order 
difference penalty (right panel), and a combination thereof, which we term "structured 
elastic net" (middle panel). 



with _LL denoting conditional independence. Property (6) is conformed to 
the following choice for the precision matrix A = (Jjj')i<j j'<p- 

r v 

^K(^-,^))|, if i = j', 

k=l 

which is singular in general. If sign{io((/3j,/3j/))} > for all (f3j,fif) in E, 
then A as given in equation (7) is known as the combinatorial graph Lapla- 
cian in the spectral graph theory [Chung (1997)]. It is straightforward to 
verify the following properties: 



(7) 
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Fig. 2. Handwritten digit recognition data set [Le Cun et al. (1989)]. One observation 
is given by a greyscale image composed of 16 x 16 pixels. The upper panel shows the 
contour of the pixel-wise means for the numerals "8" and "9. " We use a training set of 
1500 observations of eights and nines as input for linear logistic regression. The lower 
panel depicts the coefficient surfaces for the lasso (left panel), a discrete Laplacian penalty 
according to the grid structure (right panel), and a combination, the structured elastic net 
(middle panel). 



(8) /3 T A/3= Y, k(/3i,/3 J OI(/3 j -sign{ U ;((/3 i ,/3 i O)}/3 J 2 >0, 

where the sum is over all distinct edges in Q, and "distinct" is understood 
with respect to the relation (/%,/%') = ((3j*,f3j) for all j,j'. 
• If Q is connected and sign{u;((/3j, f3j*))} > for all (/3j,/3j/) in E, the null 
space of A is spanned by the vector of ones 1. 

While we have started in full generality, the choice w(((3j, /%')) £ {0,1} for 
all j,j' will frequently be the standard choice in practice. In this case, the 
quadratic form captures local fluctuations of (5 w.r.t. Q. As a simple example, 
one may take Q as the path on p vertices so that expression (8) equals the 
summed squared forward differences 

(9) - &-i) 2 = \\ B PW 2 = /3 t d t da 

i=2 
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Fig. 3. A collection of some graphs. A path and a grid (left panel), a rooted tree (middle 
panel), and an irregular graph describing a part of the so-called MAPK signaling pathway 
(right panel). 

where D is defined in equation (2). More complex graphical structures can be 
generated from simple ones using the notion of Cartesian products of graphs 
[Chung (1997), page 37] . For instance (as displayed in the left panel of Figure 
3), the Cartesian product of a p-path and a p'-path equals a p x p' regular 
grid, in which case the standard choice of A is seen to be a discretization 
of the Laplacian A acting on functions defined on M 2 . Regularizers built up 
from discrete differences have already seen frequent use in high-dimensional 
regression estimation. Examples comprise penalized discriminant analysis 
[Hastie, Buja and Tibshirani (1995)] and spline smoothing [Eilers and Marx 
(1996)]. 

2.3. Connection to manifold regularization. As pointed out by one of 
the referees, regularizers of the form (8) are applicable to a variety of learn- 
ing problems in which data are supposed to be generated according to a 
probability measure supported on a compact, smooth manifold M C W. 
The canonical regularization operator acting on smooth functions on M is 
the Laplace-Beltrami operator Am [Rosenberg (1997)], which generalizes 
the Laplacian for Euclidean domains. As suggested, for example, in Belkin, 
Niyogi and Sindwhani (2006), given a set of data points in IR P , a discrete 
proxy for a potential manifold structure can be obtained by computing a 
(possibly weighted) neighborhood graph of the points, and, in turn, a proxy 
for Am is obtained by a discrete Laplacian of the form (7) resulting from 
the neighborhood graph. 

Relating these ideas to our framework, one might think of settings where 
each of the {xj}" =1 represents a collection of p points sampled on a compact, 
smooth manifold M . This is a natural extension of the introductory exam- 
ples in Section 2.1, where the corresponding M would be given by an interval 
and a rectangle, respectively. Assuming a linear relationship between scalar 
responses {yOiLi an d the predictors {xj}™ =1 , we expect the corresponding 
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coefficient vector to be both sparse and smooth with respect to the mani- 
fold structure. Without going into detail, the approach might be useful for 
predictors with geographical information. The idea is illustrated in Figure 4 
where M is chosen as a sphere embedded in M 3 . 

3. Properties. 

3.1. Bayesian and geometric interpretation. In the setup of Section 1, 
consider the regularizer 

0(/3) = A 1 ||/3|| 1 + A 2 /3 T A/3, Ai,A 2 >0. 

It has a nice Bayesian interpretation when the loss function L is of the form 

(10) L(y, /(x; /3 , &)) = _1 (6(/(x)) - y/(x)) + c(y, </>), 

that is, the loss function equals the negative log-likelihood of a generalized 
linear model in canonical parametrization, which will primarily be studied 
in this paper. Models of this class are characterized by [cf. McCullagh and 
Nelder (1989)] 

Y |X = x ~ simple exponential family, 




Fig. 4. A manifold setting suitable to our regularizer. The black dots represent points 
at which the random variables Xj,j = 1, . . . ,p, are realized, and the spikes normal to the 
surface indicate the size of the corresponding /3*,j = l,...,p. Except for the two groups 
highlighted in the left and right panel, respectively, the coefficients equal zero. The dashed 
lines represent the neighborhood graph obtained by connecting each dot with its four nearest 
neighbors with respect to the geodesic distance on the sphere. 
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Fig. 5. Level sets {(/3 1 ,f3 2 ) ■ Ai(|/?i + |/3 2 |) + A 2 (/3i - /3 2 ) 2 = 1} (2e/S pane/, so/id Zinesj 
and {(/3i, /3 2 ) : Ai ( |/3i | + |/3 2 |) + A 2 (/3i + /3 2 ) 2 = 1} (7e/t panel, dashed lines) of the structured 
elastic net regularizer and {(/3i,/3 2 ) : Ai(|/3i| + |/3 2 |) + A 2 |/3i — /3 2 | = 1} of the fused lasso. 



The form (10) is versatile, including classical linear regression with Gaus- 
sian errors, logistic regression for classification, and Poisson regression for 
count data. Given a loss from the class (10), the regularizer 0(/3) can be 
interpreted as the combined Laplace (double exponential)-Gaussian prior 
p((3) oc exp(— Q((3)), for which the structured elastic net estimator (5), pro- 
vided p{fio) 1 ? is the maximum posterior (MAP) estimator given the sam- 
ple S. It is instructive to consider two predictors, that is, (3 = (/3i,/32) T . 
Figure 5 gives a geometric interpretation for the basic choices 



corresponding to positive- and negative prior correlation, respectively. 

The contour lines of the structured elastic net penalty contain elements of 
a diamond and an ellipsoid. The higher A2 in relation to Ai, the ellipsoidal 
part becomes more narrower and more stretched. The sign of the off-diagonal 
element of A determines the orientation of the ellipsoidal part. 

3.2. A grouping property. For the elastic net, Zou and Hastie (2005) pro- 
vided an upper bound on the absolute distances |/3| Iastlc nct - plastic net | ^ ^ y _ 
l,...,p, in terms of the sample correlations, to which Zou and Hastie re- 
ferred to as "grouping property." We provide similar bounds here. For what 
follows, let S be a sample as in Section 1. We introduce a design matrix 



var[y|X = x]=^6(/(x)). 




and 
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X = (xij)\<i< n and denote by Xj = (x±j, . . . ,x n j) the realizations of pre- 

i<i<P 

dictor j in S, and the response vector is defined by y = (yi, . . . ,y n ) T ■ For 
the remainder of this section, we assume that the responses are centered and 
that the predictors are centered and standardized to unit Euclidean length 
w.r.t. the sample S, that is, 

n n n 

(12) ^2 yi = ^2 = °> ^2 x % = 1 > j = i >---,p- 

i=i i=i i=i 

Proposition 1. Letting p = 2, let the loss function be of the form (10), 
let p = X 1 r X2 denote the sample correlation of Xi and X2, and let A = 
|g I), s€{-l,l}. If-afafoXl, then 

\A + sA\<^V^ + sp)\\y\\. 

In particular, in the setting^ of Proposition 1, we have the implication that 
if Xi = — sX-2, then /3i = —s@2- 

3.3. Decorr elation. Let us now consider the important special case 

L(y,/(x;/3)) = (y-x T / 3) 2 , 

which corresponds to classical linear regression. The constant term /3q is 
omitted, since we work with centered data. The structured elastic net esti- 
mator can then be written as 

3 = argmin -2y T X/3 + (3 T [C + A 2 A]/3 + Xi\\(3\\i, C = X T X, 
/3 

(13) 

= argmin-2y T X/3 + (3 T C(3 + Ai||/3||i, C = X T X + A 2 A. 
/3 

Note that for standardized predictors, C equals the matrix of sample cor- 
relations pjji = XjXj/, j, j' = 1, . . . ,p. With a large number of predictors or 
elements pjj' with large C is known to yield severely unstable ordinary 

least squares (ols) estimates /3? ls , j = 1, . . . ,p. If the two underlying random 
variables Xj and Xji are highly positively correlated, this will likely trans- 
late to high sample correlations of Xj and Xj/ , which in turn yield a strongly 
negative correlation between /3? ls and /3?, ls and, as a consequence, high vari- 
ances var[/3° ls ] and var[/3°, ls ]. In the prevalence of high correlations, perfor- 
mance of the lasso may degrade as well. For example, Donoho, Elad and 
Temlyakov (2006) showed that the lower the mutual coherence maxjj/ 

the more stable is lasso estimation. The modified matrix C can be written as 
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C = V][ 2 K A V][ 2 , V A = diag(l + A 2 2X1 \hk\,- • • , 1 + A 2 ELi M)> and 
the modified correlation matrix Ra has entries 

Pjj/ + \2ljf . ., 

PA,jf = —j= ■ 1 J, J =l,...,p. 

y 1 + ELi M V 1 + ELi \h'k\ 

In light of Section 2, the entries of Ra combine sample- and prior correla- 
tions. Decorrelation occurs if pjji ~ —\2ljji. 

4. Consistency. The asymptotic analysis presented in this section closely 
follows the ideas of Knight and Fu (2000) and Zou (2006). Both have studied 
asymptotics for the lasso in linear regression for a fixed number of predictors 
under conditions ensuring y^n-consistency and asymptotic normality of the 
ordinary least squares estimator. Knight and Fu (2000) proved that the lasso 
estimator ( (3 lasso is -^/n-consistent for the true coefficient vector f3* provided 
A™ = 0(y/n). Zou (2006) has shown that while this choice of A" provides 
the optimal rate for estimation, it leads to inconsistent feature selection. 
Define the active set as A = {j : f3* 7^ 0} and A c = {1, . . . ,p] \ A and let 5 

be an estimation procedure producing an estimate (3 s . Then 5 is said to be 
selection consistent if 

lim P0l n ± 0) = 1 for j G A, 
lim P(j9? = 0) = 1 for j e A c , 

n— >oo •" 

where here and in the following, the sub- or superscript n indicates that the 
corresponding quantity depends on the sample size n. Moreover, Zou (2006) 
and Zhao and Yu (2006) have shown that if A" = o(n) and \i/y/n—> 00, 
the lasso has to satisfy a nontrivial condition, the so-called "ir represent able 
condition," to be selection consistent. Zou (2006) proposed the adaptive 
lasso, a two-step estimation procedure, to fix this deficiency. In the following, 
these results will be adapted to the presence of a second quadratic penalty 
term. 

Theorem 1. Define 

% = argmin ||y n - X n /3|| 2 + X[\\p\\ x + X^/3 T A(3. 

Assume that A^/y^n — > > and X^j^fn^f A 2 > 0. Consider the random 
function 

V(u) = -2u T w + u T Cu 

V 

+ A? u i sign(^)I(/3* + 0) + \ Uj \I((3* = 0) 
3=1 

+ 2A^u T A/3*, w~iV(0,CT 2 C). 
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Then, under conditions (C.1)-(C.3) in the online supplement, y/n(/3 n — 
(3*) —> argminl/(u). 

Theorem 1 is analogous to Theorem 2 in Knight and Fu (2000) and es- 
tablishes -^/re-consistency of (3 n , provided A" and A2 are 0(y/n). Theorem 1 
admits a straightforward extension to the class of generalized linear models 
[cf. equation (11)]. Let the true model be defined by 

E[Y\X = x] = &'(/(x; £*)), /(x) = x T /3*. 

For the sake of a clearer presentation, we assume that /3g = 0. We study the 
estimator 

n 

(14) X = argmin2^- 1 V6(/(x i; /3)) - yJ(x,;/3) + \<{\\Ph + A£/3 T A/3. 

Theorem 2. For the estimator (14), let A^/a/xi — >■ A? > and X^/y/n^- 
A2 > 0. Consider the random function 

W(u) = -2u T w + u T Xu 

v 

+ A? U J sign(/3*)/(/3* + 0) + \ Uj \I(/3* = 0) 

i=y 

+ 2A 2 l u T A/3*, w~iV(0,Z). 

Then under conditions (G.l) and (G.2) in the online supplement, y/n{j3 n — 
(3*) -> argmin jy(u). 

Now let us turn to the question of selection consistency. In the setup of 
Theorem 1, if A" and AJ? both are 0(y/n), then, using arguments similar to 
those in Knight and Fu (2000) and Zou (2006), (3 n is shown not to be selec- 
tion consistent. Selection consistency can be achieved if one lets A", A2 grow 
more strongly and if the quantities C, A, and (3* jointly fulfill a nontrivial 
condition, which can be seen as analog to the irrepresentable condition of 
the lasso [Zou (2006), Zhao and Yu (2006)]. 

Theorem 3. In the situation of Theorem 1, let A™ /re — > 0, A^/^/re — > 00, 
X% / A" — > R, < R < 00 and consider the partitioning scheme 

(15) 

A= (A A A AA ,y 
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so that here and in the following, the subscripts A and A c refer to active and 
inactive set, respectively. Then, if selection consistency holds, the following 
condition must be fulfilled: there exists a sign vector s A such that 

\-C AcA C- A \s A + 2RA A (3* A ) + 2RA A c A (3* A \ < 1, 

where the inequality is interpreted componentwise. 

While this condition is interesting from a theoretical point of view, it is 
impossible to check in practice, since (3* A is unknown. 

Selection consistency can be achieved by a two-step estimation strat- 
egy introduced in Zou (2006) under the name adaptive lasso, which re- 
places £ 1 -regularization uniform in f3j, j = 1, . . . ,p, by a weighted variant 
= Y^j=i > where the weights {ujj}^ =1 are determined adaptively 
as a function of an "initial estimator" (3 : 

(i6) ^-H^r 7 , 7 >0,i=l,...,p. 

In terms of selection consistency, this strategy turns out to be favorable for 
our proposal, too. 

Theorem 4. In the situation of Theorem 1, define 

p 

^adaptive = argmin ^ _ + A?y>,-|^| + A^/3 T A/3, 

f> 

where the weights are as in equation (16), and suppose that the initial esti- 
mator satisfies 

r n (P n - (3*) = Op(l), r n y oo as n->oo. 
Furthermore, suppose that 

r^A>" 1/2 -> oo, A>~ 1/2 -> 0, A^n" 1 / 2 -»• X° 2 > 
as n — > oo. Then: 

(1) V^0fT ivC -Pa) 4iV(-A°C^A^ ( C^), 

(2) lin Woo P(3^ tive = 0) = l. 

Theorem 4 implies that the adaptive structured elastic net ^g ada P tlve i s an 
oracle estimation procedure [Fan and Li (2001)] if the bias term in (1) van- 
ishes, which is the case if /3 A resides in the null space of A A . Interestingly, if 
A equals the combinatorial graph Laplacian (cf. Section 2.2), this happens if 
and only if (3* A has constant entries and A specifies a connected component 
in the underlying graph. 
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Concerning the choice of the initial estimator, the ridge estimator has 
worked well for us in practice, provided the ridge parameter is chosen ap- 
propriately. While 7 may be treated as a tuning parameter, we have set 7 
equal to 1 in all our data analyses. Last, we remark that while Theorem 4 
applies to linear regression, it can be extended to hold for generalized linear 
models, similarly as we have extended Theorem 1 to Theorem 2. 

5. Computation. This section discusses aspects concerning computation 
and model selection for the structured elastic net estimator when the loss 
function is the negative log-likelihood of a generalized linear model (10). 

5.1. Data augmentation. From the discussions in Section 3.3, it follows 
that the structured elastic net for squared loss, assuming centered data, can 
be recast as the lasso on augmented data 

X=(*) , y=( y ) , A = QQ, 

\ A 2 H/( n+p )xp \ U /(n+p)xl 

and, hence, algorithms available for computing the lasso, notably LARS 
[Efron et al. (2004)], may be applied, which computes for fixed A2 and vary- 
ing Ai the piecewise linear solution path /3(Ai; A2). This approach is parallel 
to that proposed by Zou and Hastie (2005) for the elastic net. In addition, 
the augmented data representation is helpful when addressing uniqueness of 

1 /2 

the structured elastic net in the p>n setting: if rank(X) + rank(A 2 Q) > p 

1/2 

and the rows of X combined with the rows of A 2 Q form a linearly inde- 
pendent set, C as defined in equation (13) is of full rank and, hence, the 
structured elastic net is unique. Moreover, this shows that even for p> n, 
in principle, all features can be selected. 

In order to fit arbitrary regularized generalized linear models, the aug- 
mented data representation has to be modified. Without regularization, es- 
timators in generalized linear models are obtained by iteratively computing 



weig 


;hted least 


squares estimators: 






[p(k+i) J 


= ([1 x] T w( fc )[i 


x]r x [i x] T w( fc ) z w, 






= fW + [ W Wp 1 (y- 


V fc) ), 


(17) 


f(fc) 


_ (f (k) .(fcUT 
Ul > • • • > Jn ) 1 


/ i W = i§S* ) +xT3W | i = l,...,„, 






= (/i^ \ . . . , /4i ^) ; 


/4 fe) = ^(jf ),* = !,..., n, 






= diag(4 fc) ,...,u4 fc )) 


, «,«=0-V(//* ) ) > f = l > ...,n 
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Note that the design matrix additionally includes a constant term 1 . Turning 
back to the structured elastic net, an adaptation of the augmented data 
approach iteratively determines 

(K)=^S^ , (^- sf (?)) ,+wl " 

with 

w[ = wf" , i = 1, . . . , n, as in equation (17) , 
■55; =1, i = (n+l),...,(n + p), 
zf^ = i = l,...,n, as in equation (17), 

if } =0, i = (n + l),...,(n + p), 
Xi = (l x7) T , i = l,...,n, 
Xi=(0 VhqJ) T , i = (n + l),...,{n + p), 

with qj" denoting the ith row of Q. 

Alternatives to augmented data representation include cyclical coordinate 
descent in the spirit of Friedman et al. (2007) and a direct modification of 
Goeman's algorithm [Goeman (2007)]. Descriptions can be a found in the 
full technical report underlying this article [Slawski, zu Castell and Tutz 
(2009), available online]. 

6. Data analysis. 

6.1. One- dimensional signal regression. In one-dimensional signal regres- 
sion, as described, for example, in Frank and Friedman (1993), one aims at 
the prediction of a response given a sampled signal x T = (x(t))J =1 , where 
the indices t = 1, . . . , T, refer to different ordered sampling points. For a sam- 
ple S = {({xi(t)}J =1 ,yi), ({x n (t)}]: =l ,y n )} of pairs consisting of sampled 
signals and responses, we consider prediction models of the form 



,n. 



t=i 



6.1.1. Simulation study. Similarly to Tutz and Gertheiss (2010), we sim- 
ulate signals x(t),t = 1, . . . , T, T = 100, according to 

5 

{x{t)} ~ h sin(t7r(5 - b k )/50 - m k ) + r(t), 

k=l 

{b k }~U(0,5), {m k }~U(0,2Tr), {r(t)}~JV(0,0.25), 
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with U(a,b) denoting the uniform distribution on the interval (a, b). For 
the coefficient function f3*(t),t = 1, . . . ,T, we examine two cases. In the first 
case, referred to as the "bump setting," we use 

( -{(30 - t) 2 + 100}/200, t = 21, . . . , 39, 
|8*(*) = < {(70-t) 2 -100}/200, i = 61,...,80, 
[ 0, otherwise. 

In the second case, referred to as the "block setting," 

(3* = (0, . 0, 0.5, ■ . . ,0.5 ,1, ■ ■■ , 1, 0.5, ■ .. ,0.5 , 0.25, ■ . . ,0.25 ,0, . •■,0) T - 

20 times 10 times 10 times 10 times 10 times 40 times 

The form of the signals and coefficient functions are displayed in Figure 6. 
For both settings, data are simulated according to 

T 

y = J2x(t)P*(t) + e, e~iV(0,5). 
t=i 

For each out of 50 iterations, we simulate i = 1,...,500 i.i.d. realizations 
and divide them into three parts: a training set of size 200, a validation 
set of size 100, and a test set of size 200. Hyperparameters of the methods 
listed below are optimized by means of the validation set. As performance 
measures, we compute the absolute distance L l ((3,j3) = \\(3 — of true- 
and estimated coefficients and the mean squared prediction error on the test 
set. For methods with built-in feature selection, we additionally evaluate the 
goodness of selection in terms of sensitivity and specificity. For each of the 
two setups, the simulation is repeated 50 times. The following methods are 
compared: ridge regression, generalized ridge regression with a first difference 
penalty, P-splines according to Eilers and Marx (1999), lasso, fused lasso, 
elastic net, structured elastic net with a first difference penalty, adaptive 




Fig. 6. The setting of the simulation study. A collection of five signals (left panel), 
the coefficient functions for "bump" — (middle panel) and "block" setting (right panel), 
respectively. 
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structured elastic net, where the weights {uj(t)} are chosen according to the 
ridge estimator of the same iteration as oo(t) = l/|/3 ridgc (t)|. 

Performance measures are averaged over 50 iterations and displayed in 
Table 1 (bump setting) and Table 2 (block setting), respectively. 

For the bump setting, Figure 7 shows that the double-regularized proce- 
dures employing decorrelation clearly outperform a visibly unstable lasso. 
Due to a favorable signal-to-noise ratio, even simplistic approaches such as 
ridge- or generalized ridge regression show competitive performance with 
respect to prediction of future observations. In pure numbers, the estima- 
tion of /3*(t) is satisfactory as well. However, the lack of sparsity results into 
"noise fitting" for those parts where f3*(t) is zero. For the two settings ex- 
amined here, the P-spline approach does not improve over generalized ridge 
regression, because the two coefficient functions are not overly smooth. The 
elastic net considerably improves over the lasso, but it lacks smoothness. 
Its numerical inferiority to ridge regression results from double shrinkage 
as discussed in Zou and Hastie (2005). The performance of the structured 
elastic net is not fully satisfactory. In particular, at the changepoints from 
zero- to nonzero parts, there is a tendency to widen unnecessarily the sup- 
port of the nonzero sections. This shortcoming is removed by the adaptive 
structured elastic net, thereby confirming the theoretical result concerning 



Table 1 

Results for the bump setting, averaged over 50 simulations 



Method 




PE 


Sensitivity 


Specificity 


Ridge 


0.249 


5.35 








(5.9 x 1(T 4 ) 


(0.078) 






G. ridge 


0.238 


5.32 








(9.9 x 10" 4 ) 


(0.076) 






P-spline 


0.241 


5.30 








(16.0 x 10" 4 ) 


(0.077) 






Lasso 


0.271 


5.72 


0.62 


0.65 




(23.9 x 10~ 4 ) 


(0.079) 


(8.9 x 10" 3 ) 


(0.016) 


Fused lasso 


0.235 


5.30 


0.96 


0.51 




(7.2 x 10~ 4 ) 


(0.075) 


(5.5 x 10" 3 ) 


(0.010) 


Enet 


0.246 


5.46 


0.93 


0.69 




(29.9 x 10~ 4 ) 


(0.081) 


(0.013) 


(0.032) 


S.enet 


0.232 


5.30 


0.98 


0.59 




(7.6 x KT 4 ) 


(0.078) 


(7.8 x 10" 3 ) 


(0.029) 


Ada. s. enet 


0.232 


5.25 


0.91 


0.82 




(15.0 x 10" 4 ) 


(0.075) 


(21.0 x 10" 3 ) 


(0.020) 



For annotation, see Table 2. 
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Table 2 

Results for the block setting, averaged over 50 simulations. We make use of the following 
abbreviations: "PE" for "mean squared prediction error," "g. ridge" for "generalized 
ridge," "enet" for "elastic net," "s.enet" for "structured elastic net," and "ada.s.enet" 
for "adaptive structured elastic net. " Standard errors are given in parentheses. For each 
column, the best performance is emphasized in boldface 



Method 


i 1 (3,/3*) 


PE 


Sensitivity 


Specificity 


Ridge 


0.082 


5.41 








(3.4 x 10~ 3 ) 


(0.080) 






G. ridge 


0.064 


5.35 








(1.9 x 10~ 3 ) 


(0.078) 






P-spline 


0.065 


5.34 








(1.9 x 10" 3 ) 


(0.077) 






Lasso 


0.207 


6.12 


0.73 


0.62 




(3.6 x 10" 3 ) 


(0.089) 


(7.5 x 10" 3 ) 


(0.014) 


Fused lasso 


0.058 


5.34 


0.99 


0.51 




(1.9 x 10~ 3 ) 


(0.076) 


(7x 10~ 4 ) 


(0.009) 


Enet 


0.094 


5.47 


0.95 


0.73 




(5.0 x 10" 3 ) 


(0.072) 


(6.4 x 10 -3 ) 


(0.083) 


S.enet 


0.070 


5.38 


0.99 


0.60 




(5.0 x 10~ 3 ) 


(0.080) 


(3.3 x 10" 3 ) 


(0.027) 


Ada.s.enet 


0.061 


5.32 


0.97 


0.83 




(3.2 x 10~ 3 ) 


(0.69) 


(8.0 x 10" 3 ) 


(0.018) 



selection consistency. This quality seems to be supported by the eminent 
performance with respect to sensitivity and specificity. The success of the 
adaptive strategy is also founded on the good performance of the ridge es- 
timator providing the component-specific weights u)(t). The block setting 
is actually tailored to the fused lasso, whose output are piecewise constant 
coefficient functions. Nevertheless, it is not optimal, as the shrinkage of the 
^-penalty acts on all coefficients, including those different from zero. As 
a result, the fused lasso is outperformed by the adaptive structured elas- 
tic net with respect to prediction, though the structure part is seen to be 
not fully appropriate in the block setting (cf. Figure 8). As opposed to the 
bump setting, fitting the block function seems to be much more difficult to 
accomplish in general. 

6.1.2. Accelerometer data. The "Sylvia Lawry Centre for Multiple Scle- 
rosis Research e.V.," Munich, kindly provided us with two accelerometer 
records of two healthy female persons, aged between 20 and 30. They were 
equipped with a belt containing an accelerometer integrated into the belt 
buckle before walking several minutes on a flat surface at a moderate speed. 
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t t 

Fig. 7. Estimated coefficient functions for the bump setting. The pointwise median curve 
over 50 iterations is represented by a solid line, pointwise 0.05- and 0.95-quantiles are 
drawn in dashed lines. 

The output are triaxial (vertical, horizontal, lateral) acceleration measure- 
ments at roughly 25,000 sampling points per person. Following Daumer et al. 
(2007), human gait, if defined as the temporal evolution of three-dimensional 
accelerations of the center of mass of the body, is supposed to be a quasi- 
periodic process. Every period defines one gait cycle/double step, which 
starts with the heel strike and ends with the heel strike of the same foot. 
A single step ends with the heel strike of the other foot. Therefore, a dou- 
ble step can be seen as a natural unit. As a consequence, decomposition 
of the raw signal into pieces, each representing one double step, is an inte- 
gral part of data preprocessing, not described in further detail here. Overall, 
we extract i = 1, . . . , n = 406 double steps, 242 from person B (y = 0) and 
164 from person A (y = 1), ending up with a sample {(xj,?/j)}™ =1 , where 
each Xi = (xi(t)),t = 1, . . . ,T = 102, stores the observed vertical acceleration 
within double step i,i = 1, . . . ,n. For simplicity, we neglect the dependence 
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Fig. 8. Estimated coefficient functions for the block setting. 



of consecutive double steps within the same person and treat them as inde- 
pendent realizations. Horizontal- and lateral acceleration are not considered, 
since they do not carry information relevant to our prediction problem. We 
aim at the prediction of the person (A or B) given a double step pattern, 
and additionally at the detection of parts of the signal apt for discriminating 
between the two persons. We randomly divide the complete sample into a 
learning set of size 300 and a test set of size 106, and subsequently carry 
out logistic regression on the training set, using the structured elastic net 
with a squared first difference penalty. Hyperparameters are determined by 
ten-fold cross-validation, and the resulting logistic regression model is used 
to obtain predictions for the test set. The fused lasso with the hinge loss 
of support vector machines is used as competitor. A collection of results is 
assembled in Figure 9 and Table 3, from which one concludes that classifica- 
tion is an easy task, since (nearly) perfect misclassification rates on the test 
set are achieved. Concerning feature selection, the results of the structured 
elastic net are comparable to those of the fused lasso. 



FEATURE SELECTION GUIDED BY STRUCTURAL INFORMATION 21 



£ ? 



q 













A 




\ 




S 


a 


20 


JO 


60 60 

t 


100 

















20 


40 


60 60 


100 



Fig. 9. Coefficient functions for structured elastic net-regularized logistic regression (left 
panel) and the fused lasso support vector machine (right panel). Within each panel, the 
upper panel displays the overlayed double step patterns of the complete sample (406 double 
steps). The colors of the curves refer to the two persons. 



6.2. Surface fitting. Figure 10 depicts the surface to be fitted on a 20 x 20 
grid. The surface can be represented by a discrete function (3*(t,u),t,u = 
1, . . . , 20. It consists of three nonoverlapping truncated Gaussians of different 
shape and one plateau function. We have 

P*(t,u) = B(t,u) + Gx(t,u) + G 2 (t,u) + G 3 (t,u), 

(18) B(t,u) = h{t G {10, 11, 12}, u G {3,4}), 



G\{t,u) = max< 0,exp I — (t — 3u 



3 
0.25 



t-3 
u — 8 



0.2 



Table 3 

Results of step classification for the fused lasso support vector machine and structured 

elastic net-regularized logistic regression. The bound imposed on the 1-norm of (3 
corresponding to Ai is denoted by ti, while ti corresponding to A2 denotes the bound 
imposed on the absolute differences X/t=a l<^W — P(t ~ 1)1 f or f use d lasso and the 
squared differences ^2f =2 (P{t) — fi{t — l)) 2 for the structured elastic net, respectively. 
Concerning the degrees of freedom of the two procedures, we take the number of nonzero 
blocks for the fused lasso. For the structured elastic net, we make use of a heuristic due 
to Tibshirani (1996) that rewrites the lasso fit as the weighted ridge fit; see Slawski, zu 
Castell and Tutz (2009) for details 



tl 


t 2 


Test error Degrees of freedom 


# nonzero coefficients 






Fused lasso 




2.5 


0.5 


9 


40 






Structured elastic net 




23 


2 


1 7.85 


61 
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true surface 




Fig. 10. Contours of the surface according to equation (18). 
G<z(t, u) = max |o, exp ^— (t — 7u — 17) 

x( 75 O^X^))--}. 

Gs(t, u) = max |o, exp ^— (t — 15u — 14) 

/ 0.5 -0.25\ /i-15\\ nn \ 
X V-0-25 0.5 ){u-u))- - 2 \- 

Similarly to the simulation study in Section 6.1.1, we simulate a noisy 
version of the surface according to 

y (t,u) = l3*(t,u)+£(t,u), {e(t,u)} i- ~'JV(0,0.25 2 ), t,u = 1, . . . ,20. 

For each of the 50 runs, we simulate two instances of y(t, u). The first one is 
used for training and the second one for hyperparameter tuning. The mean 
squared error for estimating (3* is computed and averaged over 50 runs. 
Results are summarized in Figure 11 and Table 4. We compare ridge, gener- 
alized ridge with a difference penalty according to the grid structure, lasso, 
fused lasso with a total variation penalty along the grid, structured- and 
adaptive structured elastic net with the same difference penalty as for gen- 
eralized ridge. The elastic net coincides — up to a constant scaling factor — 
with the lasso/soft thresholding in the orthogonal design case and is hence 
not considered. 

7. Discussion. The structured elastic net is proposed as a procedure for 
coefficient selection and smoothing. We have established a general notion 
of structured features, for which the structured elastic net is able to take 
advantage of prior knowledge as opposed to the lasso and the elastic net, 
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Fig. 11. Contours of the estimated surfaces for three selected methods, averaged pointwise 
over 50 runs. 



which are both purely data-driven. The structured elastic net may also be 
regarded as a computationally more convenient alternative to the fused lasso. 
Conceptually, generalizing the fused lasso by computing the total variation 
of the coefficients along a graph is straightforward. However, due to the 
nondifferentiability of the structure part of the fused lasso, computation 
may be intractable even for moderately sized graphs. 

Turning to the drawbacks of the structured elastic net, it is obvious 
that model selection and computation of standard errors and, in turn, the 
quantification of uncertainty, are notoriously difficult. A Bayesian approach 

Table 4 

Results of the simulation, averaged over 50 iterations ( standard errors in 
parentheses). The columns labeled B, Gi, G2, G3, and "zero" contain the 

mean prediction error for the corresponding region of the surface. The 
abbreviations equal those in Table 2. The prediction error has been rescaled 

by 100 



Method 


PE 


B 


Gi 


G 2 


G 3 


Zero 


Ridge 


1.20 
(0.01) 


0.31 


0.28 


0.20 


0.34 


0.07 


G. ridge 


1.17 
(0.04) 


0.18 


0.20 


0.14 


0.21 


0.44 


Lasso 


1.31 
(0.01) 


0.37 


0.32 


0.22 


0.39 


0.01 


Fused lasso 


0.67 
(0.02) 


0.14 


0.12 


0.08 


0.15 


0.18 


S.enet 


0.88 
(0.02) 


0.22 


0.16 


0.12 


0.23 


0.18 


Ada.s.enet 


0.56 

(0.02 ) 


0.15 


0.09 


0.08 


0.18 


0.06 
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promises to be superior in this regard. The lasso can be treated within a 
Bayesian inference framework [Park and Casella (2008)], while the quadratic 
part of the structured elastic net regularizer is already motivated from a 
Bayesian perspective in this paper. 

With regard to possible directions of future research, we will consider 
studying the structured elastic net in combination with other loss functions, 
for example, the hinge loss of support vector machines or the check loss for 
quantile regression. The asymptotic analysis in this paper is basic in the 
sense that it is bound to strong assumptions, and the role of the structure 
part of the regularizer and its interplay with the true coefficient vector is not 
well understood yet, leaving some room for more profound investigations. 
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SUPPLEMENTARY MATERIAL 

Supplement to "Feature Selection guided by Structural Information" 

(DOI: 10.1214/09-AOAS302SUPP; .pdf). The supplement contains proof 
of all statements of the main article. 
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